Experiments in the Retrieval of Unsegmented Japanese Text at the NTCIR-2 Workshop

نویسنده

  • Paul McNamee
چکیده

Our work with the Hopkins Automated Information Retriever for Combing Unstructured Text (HAIRCUT) system has made use of overlapping character n-grams in the indexing and retrieval of text. In previous experiments with Western European languages we have shown that longer length n-grams (e.g., n=6) are capable of providing an effective form of alinguistic term normalization. We have wanted to investigate whether these methods could be adapted to processing unsegmented languages such as Japanese. To that end we participated in the Japanese and English portion of the NTCIR-2 evaluation. This paper describes results in monolingual Japanese and English retrieval and in cross-language retrieval using each language as a source language for the other. We found that 6-grams performed comparably with English words and that 2-grams and 3-grams perform equally well in Japanese text. A combination of runs using each tokenization method resulted in only a marginal improvement over runs using a single approach. These two trends were consistent regardless of query length or source language.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Probabilistic Text Retrieval for NTCIR9 GeoTime

For the NTCIR-9 Workshop UC Berkeley participated only in the GeoTime track. For our initial experiments we used only the Logistic Regression ranking with blind feedback approach that we also used in NTCIR-8. We participated in both English and Japanese monolingual and bilingual search tasks. For all Japanese topics we preprocessed the text using the ChaSen morphological analyzer for term segme...

متن کامل

NTCIR Workshop: an Evaluation of Cross-Lingual Information Retrieval

This paper introduces the first NTCIR Workshop, Aug.30 Sept.1, 1999, which is the first evaluation workshop designed to enhance research in Japanese text retrieval and cross-lingual information retrieval. The test collection used in the Workshop consists of more than 330,000 documents of English and Japanese. Twentythree groups from four countries have conducted IR tasks and submitted the searc...

متن کامل

Berkeley at NTCIR-2: Chinese, Japanese, and English IR experiments

This paper reports on the work of Berkeley group at the second NTCIR workshop on Japanese & English IR and Chinese IR. A number of runs were submitted on all subtasks in the two main tasks. Our main focus on the Japanese monolingual subtask was on comparing the retrieval effectiveness of different segmentation methods. The experimental results show the bigram indexing outperformed the word-base...

متن کامل

The NTCIR Workshop : the First Evaluation Workshop on Japanese Text Retrieval and Cross-Lingual Information Retrieval

This paper introduces the outline of the first NTCIR Workshop, which is the first evaluation workshop designed to enhance research in Japanese text retrieval and cross-lingual information retrieval. The test collection used in the Workshop consists of more than 330,000 documents with more than half are EnglishJapanese paired. Twenty-three groups from four countries have conducted IR tasks and s...

متن کامل

Preface of NTCIR-8

NTCIR-8 Meeting is where the groups who actively participated in one or more tasks set by NTCIR-8 report out their latest results obtained from the evaluation workshop. The NTCIR evaluation workshop series are designed to enhance research in information access technologies, including text retrieval, cross-language information access, question-answering, information extraction, text mining, etc....

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2001